Libraries used across the project:
rm(list=ls())
library(tidyverse)
library(eurostat)
library(leaflet)
library(sf)
library(scales)
library(cowplot)
library(ggthemes)
library(dplyr)
library(tidyr)
library(plotly)
library(factoextra)
library(dplyr)
library(sf)
library(ggplot2)
library(scales)
library(ggsci)
library(wesanderson)
library(readxl)
library(ConvergenceClubs)
library(tidyverse)
library(haven)
library(ggrepel)
library(isoband)
library(readxl)
library(mFilter)
tgs00003 <- get_eurostat("tgs00003", time_format = "num")
## Dataset query already saved in cache_list.json...
## Reading cache file C:\Users\BERNAR~3\AppData\Local\Temp\Rtmp0MXKMQ/eurostat/91b3ee552354bfc42aefc1d221152ed7.rds
## Table tgs00003 read from cache file: C:\Users\BERNAR~3\AppData\Local\Temp\Rtmp0MXKMQ/eurostat/91b3ee552354bfc42aefc1d221152ed7.rds
There are two main reasons to analyze the existence of clusters in terms of any statistical variable. The first reason is that analyzing these clusters allows us to study the evolution of each country compared to the rest of the European Union countries to evaluate whether sustained and equal growth has been achieved in the European Union. On the other hand, the second reason is that clustering the countries or regions of the European Union allows us to focus aid on those countries most in need of assistance (Saba, 2021), while also giving the criteria to quantify how much help each cluster would need.
For that matter this project has the objective of clustering regions of the NUTS2, in terms of regional gross domestic product (in million EUR). Throughout the project several clustering algorithms will be used such as the famous KMeans and the ClubConvergence algorithm as a way to explain the advantages/disadvantages KMeans has and to compare both type of algorithms in this type of objective.
tgs00003 <- get_eurostat("tgs00003", time_format = "num")
## Dataset query already saved in cache_list.json...
## Reading cache file C:\Users\BERNAR~3\AppData\Local\Temp\Rtmp0MXKMQ/eurostat/91b3ee552354bfc42aefc1d221152ed7.rds
## Table tgs00003 read from cache file: C:\Users\BERNAR~3\AppData\Local\Temp\Rtmp0MXKMQ/eurostat/91b3ee552354bfc42aefc1d221152ed7.rds
#Eurostat package Regional gross domestic product by NUTS 2 regions - million EUR"
As we are working with longitudinal data the descriptive analysis is really straightforward and simple as we can plot visually the data, as we are working with NUTS2 regions (334 regions approximately) it is difficult to visualize in a line graph to show the evolution of the values across the years, if we were working just with the countries we would not have this “problem” for that matter the best to visualize and do a descriptive analysis is to plot the map of the NUTS2 regions.
Before plotting the data we can convert the data to a wide format (as a way to do summary() per year rather than for the entire value):
tgs00003_wide <- tgs00003 |>
pivot_wider(names_from = TIME_PERIOD, values_from = values)
summary(tgs00003_wide)
## freq unit geo 2011
## Length:307 Length:307 Length:307 Min. : 0
## Class :character Class :character Class :character 1st Qu.: 10348
## Mode :character Mode :character Mode :character Median : 23076
## Mean : 41894
## 3rd Qu.: 52528
## Max. :615497
##
## 2012 2013 2014 2015
## Min. : 0 Min. : 0 Min. : 0 Min. : 0
## 1st Qu.: 10182 1st Qu.: 10455 1st Qu.: 11004 1st Qu.: 11634
## Median : 23760 Median : 23966 Median : 24393 Median : 25243
## Mean : 42563 Mean : 43065 Mean : 43888 Mean : 45690
## 3rd Qu.: 53974 3rd Qu.: 54508 3rd Qu.: 56257 3rd Qu.: 57436
## Max. :630605 Max. :644681 Max. :653969 Max. :670619
##
## 2016 2017 2018 2019
## Min. : 0 Min. : 0 Min. : 0 Min. : 0
## 1st Qu.: 11951 1st Qu.: 12621 1st Qu.: 13030 1st Qu.: 13457
## Median : 25642 Median : 27136 Median : 27218 Median : 29541
## Mean : 46750 Mean : 48448 Mean : 49679 Mean : 51415
## 3rd Qu.: 58669 3rd Qu.: 60653 3rd Qu.: 62071 3rd Qu.: 63683
## Max. :685853 Max. :705343 Max. :729516 Max. :759582
##
## 2020 2021 2022
## Min. : 0 Min. : 0 Min. : 0
## 1st Qu.: 12847 1st Qu.: 13740 1st Qu.: 15378
## Median : 28652 Median : 31756 Median : 34952
## Mean : 49337 Mean : 53838 Mean : 58294
## 3rd Qu.: 61947 3rd Qu.: 68581 3rd Qu.: 71873
## Max. :701018 Max. :757630 Max. :782639
## NA's :18
As we can observe in the summary, there may be some “region” with value 0, and 2022 has 18 missing values, for that matter we will just consider from 2011 up to 2021.
According to the Nomenclature of territorial units for statistics - NUTS 2016/EU-28, Huzz would be “Extra regio NUTS2” for that matter we will delete that region and the 2022 year.
tgs00003_wide <- tgs00003_wide |>
select(-"2022") |>
filter( geo != "HUZZ")
summary(tgs00003_wide)
## freq unit geo 2011
## Length:306 Length:306 Length:306 Min. : 14
## Class :character Class :character Class :character 1st Qu.: 10443
## Mode :character Mode :character Mode :character Median : 23094
## Mean : 42030
## 3rd Qu.: 52551
## Max. :615497
## 2012 2013 2014 2015
## Min. : 14.3 Min. : 14.9 Min. : 15.4 Min. : 15.5
## 1st Qu.: 10286.5 1st Qu.: 10555.1 1st Qu.: 11036.4 1st Qu.: 11773.7
## Median : 23823.3 Median : 23965.8 Median : 24448.9 Median : 25423.6
## Mean : 42702.1 Mean : 43205.9 Mean : 44031.0 Mean : 45839.3
## 3rd Qu.: 53976.2 3rd Qu.: 54778.8 3rd Qu.: 56287.6 3rd Qu.: 57455.6
## Max. :630604.7 Max. :644681.1 Max. :653969.0 Max. :670619.2
## 2016 2017 2018 2019
## Min. : 16.1 Min. : 17.2 Min. : 19 Min. : 20.3
## 1st Qu.: 12009.5 1st Qu.: 12806.6 1st Qu.: 13043 1st Qu.: 13606.6
## Median : 25697.0 Median : 27394.4 Median : 27933 Median : 29609.4
## Mean : 46902.6 Mean : 48605.8 Mean : 49841 Mean : 51582.8
## 3rd Qu.: 58778.3 3rd Qu.: 60780.9 3rd Qu.: 62087 3rd Qu.: 63762.5
## Max. :685852.6 Max. :705342.6 Max. :729516 Max. :759581.8
## 2020 2021
## Min. : 14.6 Min. : 22.8
## 1st Qu.: 12959.4 1st Qu.: 13816.5
## Median : 28742.5 Median : 31934.9
## Mean : 49498.5 Mean : 54014.3
## 3rd Qu.: 62204.4 3rd Qu.: 68735.6
## Max. :701018.1 Max. :757630.1
If we wanted to impute the NA values we could perform and auto.arima model from library(forecast) but as the values for the 2022 year are still “provisional” according to the Eurostat database they may change in the near future so there is no reason to take them in our analysis.
What we observe is a high difference between the lowest value and the highest value across the years, we observe also that the mean is bigger than the median showing that there is skewness on our data.
tgs_longvalues <- tgs00003_wide[, -c(1:3)]
dataplot <- pivot_longer(tgs_longvalues, cols = `2011`:`2021`, names_to = "Year", values_to = "Value")
plot_ly(dataplot, x = ~Value, type = 'histogram', color = ~factor(Year)) |>
layout(barmode = 'stack', title = 'Interactive Histogram for each Year', xaxis = list(title = 'Values'), yaxis = list(title = 'Frequency'))
## Warning in RColorBrewer::brewer.pal(N, "Set2"): n too large, allowed maximum for palette Set2 is 8
## Returning the palette you asked for with that many colors
## Warning in RColorBrewer::brewer.pal(N, "Set2"): n too large, allowed maximum for palette Set2 is 8
## Returning the palette you asked for with that many colors
As we can observe in this interactive plot (values are stacked) it is that there are some regions with high values (capital cities regions or main regions in each country) this is one of the main issues we will have throughout the analysis although we can scale the data in order to avoid this problem as KMeans is really sensitive to “outliers” or high values or disparity across the data. We can observe that there are some NUTS2 regions that have higher values compared to the rest such as the case of Paris which is the highest value across all years among other capital cities/regions that distort the Kmeans results.
SHP_2_3035 <- get_eurostat_geospatial(
resolution = 10,
nuts_level = 2,
year = 2021,
crs = 3035)
## Loading required namespace: giscoR
## Extracting data using giscoR package, please report issues on https://github.com/rOpenGov/giscoR/issues
tgs00003_shp <- tgs00003 |>
filter(TIME_PERIOD == 2019) |>
select(geo, values) |>
right_join(SHP_2_3035, by = "geo") |>
st_as_sf()
tgs00003_shp |>
ggplot(aes(fill = values)) +
geom_sf(
size = 0.1,
color = "#333333"
) +
scale_fill_distiller(
palette = "YlGnBu",
direction = 1,
name = "EUR Millions ",
breaks = pretty_breaks(10),
na.value = "gray80",
guide = guide_colorbar(
direction = "vertical",
title.position = "top",
label.position = "right",
barwidth = unit(0.4, "cm"),
barheight = unit(6, "cm"),
ticks = TRUE,
)
) +
scale_x_continuous(limits = c(2500000, 7000000)) +
scale_y_continuous(limits = c(1600000, 5200000)) +
labs(
title = "Regional gross domestic product",
subtitle = "by NUTS 2 regions - million EUR",
caption = "Data: Eurostat tgs00003"
) +
theme_void() +
theme(legend.position = c(0.94, 0.70))
SHP_2_3035 <- get_eurostat_geospatial(
resolution = 10,
nuts_level = 2,
year = 2021,
crs = 3035)
## Extracting data using giscoR package, please report issues on https://github.com/rOpenGov/giscoR/issues
tgs00003_shp <- tgs00003 |>
select(geo, values) |>
right_join(SHP_2_3035, by = "geo") |>
st_as_sf()
tgs00003_shp |>
ggplot(aes(fill = values)) +
geom_sf(
size = 0.1,
color = "#333333"
) +
scale_fill_distiller(
palette = "YlGnBu",
direction = 1,
name = "EUR Millions ",
breaks = pretty_breaks(10),
na.value = "gray80",
guide = guide_colorbar(
direction = "vertical",
title.position = "top",
label.position = "right",
barwidth = unit(0.4, "cm"),
barheight = unit(6, "cm"),
ticks = TRUE,
)
) +
scale_x_continuous(limits = c(2500000, 7000000)) +
scale_y_continuous(limits = c(1600000, 5200000)) +
labs(
title = "Regional gross domestic product",
subtitle = "by NUTS 2 regions - million EUR",
caption = "Data: Eurostat tgs00003"
) +
theme_void() +
theme(legend.position = c(0.94, 0.70))
plot_europe <- function(year) {
tgs00003_shp_filtered <- tgs00003 |>
filter(TIME_PERIOD == year) |>
select(geo, values) |>
right_join(SHP_2_3035, by = "geo") |>
st_as_sf()
ggplot(tgs00003_shp_filtered, aes(fill = values)) +
geom_sf(
size = 0.1,
color = "#333333"
) +
scale_fill_distiller(
palette = "YlGnBu",
direction = 1,
name = "EUR Millions ",
breaks = pretty_breaks(10),
na.value = "gray80",
guide = guide_colorbar(
direction = "vertical",
title.position = "top",
label.position = "right",
barwidth = unit(0.4, "cm"),
barheight = unit(6, "cm"),
ticks = TRUE
),
labels = scales::label_number(accuracy = 1) # Format labels without scaling
) +
scale_x_continuous(limits = c(2500000, 7000000)) +
scale_y_continuous(limits = c(1600000, 5200000)) +
labs(
title = "Regional gross domestic product",
subtitle = paste("by NUTS 2 regions - million EUR, Year:", year),
caption = "Data: Eurostat tgs00003"
) +
theme_void() +
theme(legend.position = c(0.94, 0.70))
}
plot_europe(2015)
plot_europe(2017)
The next step of the project is to perform a PCA and interpret the information we obtain.
X <- tgs00003_wide |> select("2011":"2021")
regions <- tgs00003_wide$geo
pca = prcomp(X, scale=T)
summary(pca)
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## Standard deviation 3.3092 0.18981 0.08580 0.05250 0.03782 0.02788 0.01654
## Proportion of Variance 0.9956 0.00328 0.00067 0.00025 0.00013 0.00007 0.00002
## Cumulative Proportion 0.9956 0.99882 0.99949 0.99974 0.99987 0.99994 0.99997
## PC8 PC9 PC10 PC11
## Standard deviation 0.01348 0.01032 0.007425 0.005667
## Proportion of Variance 0.00002 0.00001 0.000010 0.000000
## Cumulative Proportion 0.99998 0.99999 1.000000 1.000000
fviz_screeplot(pca, addlabels = TRUE)
As we can observe in the results, we only need one component to nearly explain all the variability across our data (due to being longitudinal data, that is to say, one variable across the years). For that matter it would be preferable to take the base value and its yearly growth across the years in case we wanted to reduce the number of columns or perform certain type of clustering or algorithms such as BetaConvergence analysis, if we had countries (a small sample such as EU countries) if we wanted to cluster them according to only one variable it would be recommended to use base value and growth to keep a suitable proportion between variables and “rows”.
In the case of using PC for longitudinal data there are other advanced methodologies that are recommended for this type of data would be for example the Functional PCA
barplot(pca$rotation[,1], las=2, col="darkblue")
As we observe in this plot following our previous hypothesis each year has the same importance. If were working with other type of data, the analysis for each PC and how PC works would be different as there would be interpretation on which variables account for the greatest importance in terms of explaining the variability of the data.
To simplify the objective of the PC analysis, it is done to reduce the amount of variables in a data set maintaining patterns and trends across the data, this is done through creating new variables that are linear combinations of others using the eigenvectors and eigenvalues to obtain the covariance matrix used to obtain the components.
fviz_contrib(pca, choice = "var", axes = 1)
In this case we observe that the year with the “highest” contribution would be 2017 although they are all nearly exactly the same.
The red dashed line on the graph above indicates the expected average contribution. If the contribution of the variables were uniform, the expected value would be 1/length(variables) = 1/11 = 9%
regions[order(pca$x[,1])][1:10]
## [1] "MTZZ" "RSZZ" "LVZZ" "FIZZ" "SEZZ" "ROZZ" "ATZZ" "PTZZ" "NO0B" "BEZZ"
regions[order(pca$x[,1], decreasing=T)][1:10]
## [1] "FR10" "ITC4" "DE21" "FRK2" "TR10" "ES30" "ES51" "DEA1" "DE11" "DE71"
If we observe the values we get, the first 10 regions correspond to the “richest” regions in our original data set, the PC1 is taking into account the trend for that matter we only observe.
In this type of data PCA does not seem suitable of adding information, as getting only one component would not make sense to cluster according to only that component. Although it may be interesting to “add” more components that would come from other longitudinal data of other variables, that is to say if this component accounts for the GDP per region across 2011 to 2021 it would be interesting to add other variables such as Unemployment rate per region across 2011 to 2021, and have another component explaining most of the variability forming that way a new data set that takes into account several different variables across the same set of years. Although we could potentially do the same taking base year and growth for each variable rather than using components of each subset of data to perform clustering or any other type of analysis.
Technically we should perform a Functional principal component analysis (FPCA) which is a “powerful tool for modeling longitudinal data observed at various time points. FPCA aims to decompose the latent stochastic process into a linear combination of functional principal components (FPCs), which maximize the variation in the randomly observed curves”(Shi et al., 2020). https://www.sfu.ca/~lwa68/publications/StatisticsInMedicine-2020-Shi-FPCA4LongitudinalDataWithInformative.pdf
That is to say, in this case PCA does not give us any further insight and reducing the variables into the first component for computing costs (that is to say, usually PCA is used to simplify high-dimensional data for visualization purposes, or reduce computing costs) in this case we are better off taking into account the base value and its growth if we wanted to represent the data in a two-dimension way, although we can work with time series as we are only taking into account one variable.
In the case we were working for example with clients data (in a business perspective) PCA could be used to analyze which variables explain more in terms of variability and as a way to represent the data in two-dimensions and reduce computing costs in case a machine learning algorithm is computed. For example, lets imagine we have data for each client of an imaginary business that is the money spent, the frequency, the recency, the age among other variables, we can use the PCA to analyze which variables explain most the variability in the data, to visualize the different type of clients in a plot (if we do any type of clustering) or if we want to forecast or apply any machine learning algorithm that is computationally expensive (if we have a lot of clients reducing the variables may help in terms of processing the data) such as XGBoost among other algorithms.
There are several algorithms for clustering, but the standard one is the Hartigan-Wong algorithm in which the total variance of the individuals within a cluster is defined as the sum of the squared Euclidean distances between the elements and the corresponding centroid. The centroid of each group is the center of the group that corresponds to the mean value of each individual in that cluster (Hartigan and Wong, 1979).
The clustering algorithm follows the following processes:
The algorithm randomly places k centroids in the data as initial centroids. And then, each individual \(x_i\) is assigned to the nearest centroid using the Euclidean distance
The next step is to calculate the average value of each cluster that becomes the new centroid and the individuals \(x_i\) are reassigned to the new centroids \(μ_k\).
The previous step is repeated until the centroids do not change, thus achieving that the total variation of individuals within a cluster is the minimum possible.
There are several ways to analyze the optimal number of centroids or clusters, such as the elbow method and the silhouette method.
The elbow method uses the mean distance of the observations to their respective centroid, i.e., the total variance of the individuals within a cluster. The higher the number of clusters, the lower the variance since the maximum number of clusters is equal to the number of observations, so the optimal number of clusters will be the one that an increase in the number does not substantially improve the variance within a cluster.
Silhouette analysis is used to analyze the quality of clustering. It measures the separation distance between different clusters. It tells us how close each observation of a cluster is to the observations of other clusters. The range of this method whose purpose is to analyze how many clusters range from -1 to 1, and the closer the value is to 1, it means that the observation is far away from the neighboring clusters. If the coefficient is 0 it means that it is very near or on the border between the two clusters. A negative value would indicate that it is in the wrong cluster.
There are several types of algorithms to decide how many clusters there should be in the data, the best option would be to perform as much of them as possible to decide the optimal number of clusters although in the case of businesses we may want a specific number of clusters depending on different reasons.
library(factoextra)
fviz_nbclust(X,kmeans) #silhhouette
In this case according to the silhouette method the optimal number of clusters would be 2.
Although there are some libraries that allow us to analyze different type of methods to define the optimal number of clusters.
library(NbClust)
NbClust(X, min.nc = 2, max.nc = 10, method = "kmeans")
## *** : The Hubert index is a graphical method of determining the number of clusters.
## In the plot of Hubert index, we seek a significant knee that corresponds to a
## significant increase of the value of the measure i.e the significant peak in Hubert
## index second differences plot.
##
## *** : The D index is a graphical method of determining the number of clusters.
## In the plot of D index, we seek a significant knee (the significant peak in Dindex
## second differences plot) that corresponds to a significant increase of the value of
## the measure.
##
## *******************************************************************
## * Among all indices:
## * 6 proposed 2 as the best number of clusters
## * 1 proposed 3 as the best number of clusters
## * 9 proposed 4 as the best number of clusters
## * 3 proposed 5 as the best number of clusters
## * 1 proposed 6 as the best number of clusters
## * 2 proposed 8 as the best number of clusters
## * 2 proposed 10 as the best number of clusters
##
## ***** Conclusion *****
##
## * According to the majority rule, the best number of clusters is 4
##
##
## *******************************************************************
## $All.index
## KL CH Hartigan CCC Scott Marriot TrCovW
## 2 2.7768 358.2969 185.5795 -2.7198 484.5643 8.542044e+101 1.449886e+22
## 3 0.8405 380.0467 475.9977 -32.3881 737.0411 8.421962e+101 8.551194e+21
## 4 5.2074 807.3797 132.7100 -39.7567 1222.6193 3.062863e+101 1.021310e+21
## 5 1.2984 901.8106 139.9396 -51.4403 1344.8300 3.209952e+101 4.731747e+20
## 6 3.1562 1081.2449 55.5157 -59.7957 1556.5497 2.314063e+101 4.069812e+20
## 7 0.7994 1073.4391 79.4519 -56.9414 1592.9990 2.796002e+101 4.242351e+20
## 8 4.1492 1171.9995 24.7593 -52.0379 1805.1891 1.825441e+101 2.153269e+20
## 9 4.8495 1110.0602 9.5457 -51.2874 1828.1802 2.143100e+101 1.955834e+20
## 10 0.5370 1016.0617 12.3850 -51.7384 1844.8363 2.505637e+101 1.868183e+20
## TraceW Friedman Rubin Cindex DB Silhouette Duda Pseudot2
## 2 6.074664e+12 78149.3 3.4164 0.0481 0.9009 0.7344 0.7648 90.0878
## 3 3.772008e+12 93959.8 5.5019 0.0406 0.8852 0.6617 0.6959 93.5264
## 4 1.467165e+12 122047.8 14.1451 0.0762 0.5149 0.6492 0.6765 98.0396
## 5 1.019263e+12 122797.0 20.3610 0.0515 0.5620 0.5836 3.2585 -102.5804
## 6 6.957830e+11 126749.8 29.8272 0.0505 0.5863 0.5545 1.2509 -25.4757
## 7 5.871326e+11 127022.8 35.3468 0.0439 0.6054 0.5353 2.7600 -111.5947
## 8 4.638705e+11 132790.7 44.7393 0.0455 0.6628 0.5350 1.4293 -27.0334
## 9 4.282864e+11 132252.0 48.4565 0.0390 0.6725 0.5171 1.8660 -23.6683
## 10 4.149498e+11 132719.9 50.0139 0.0370 0.6801 0.4757 7.7768 -54.8990
## Beale Ratkowsky Ball Ptbiserial Frey McClain Dunn Hubert
## 2 2.2806 0.5194 3.037332e+12 0.6477 4.2829 0.0563 0.0119 0
## 3 3.2398 0.4883 1.257336e+12 0.5797 3.3794 0.1186 0.0040 0
## 4 3.5440 0.4714 3.667913e+11 0.5475 3.3845 0.1483 0.0108 0
## 5 -5.1284 0.4295 2.038527e+11 0.4404 2.8994 0.2597 0.0050 0
## 6 -1.4827 0.3973 1.159638e+11 0.3977 3.2117 0.3152 0.0042 0
## 7 -4.6944 0.3695 8.387609e+10 0.3340 0.2292 0.4307 0.0052 0
## 8 -2.2104 0.3472 5.798381e+10 0.3344 2.1223 0.4086 0.0058 0
## 9 -3.4079 0.3278 4.758738e+10 0.3068 8.6238 0.4514 0.0068 0
## 10 -6.3607 0.3112 4.149498e+10 0.2624 1.9770 0.6214 0.0044 0
## SDindex Dindex SDbw
## 2 3e-04 83659.50 1.5666
## 3 3e-04 64316.77 1.9838
## 4 1e-04 50363.08 0.2873
## 5 1e-04 37597.51 0.3003
## 6 1e-04 31640.86 0.3137
## 7 1e-04 26451.32 0.2265
## 8 1e-04 24200.16 0.2259
## 9 1e-04 21757.62 0.2152
## 10 2e-04 20206.41 0.2056
##
## $All.CriticalValues
## CritValue_Duda CritValue_PseudoT2 Fvalue_Beale
## 2 0.8619 46.9443 0.0091
## 3 0.8559 36.0273 0.0002
## 4 0.8526 35.4516 0.0001
## 5 0.8389 28.4117 1.0000
## 6 0.8307 25.8763 1.0000
## 7 0.8037 42.7438 1.0000
## 8 0.8013 22.3132 1.0000
## 9 0.7895 13.6008 1.0000
## 10 0.7600 19.8936 1.0000
##
## $Best.nc
## KL CH Hartigan CCC Scott Marriot
## Number_clusters 4.0000 8.000 4.0000 2.0000 4.0000 4.000000e+00
## Value_Index 5.2074 1171.999 343.2878 -2.7198 485.5782 5.506188e+101
## TrCovW TraceW Friedman Rubin Cindex DB
## Number_clusters 4.000000e+00 4.000000e+00 4.00 8.0000 10.000 4.0000
## Value_Index 7.529884e+21 1.856941e+12 28087.98 -5.6754 0.037 0.5149
## Silhouette Duda PseudoT2 Beale Ratkowsky Ball
## Number_clusters 2.0000 5.0000 5.0000 5.0000 2.0000 3.000000e+00
## Value_Index 0.7344 3.2585 -102.5804 -5.1284 0.5194 1.779996e+12
## PtBiserial Frey McClain Dunn Hubert SDindex Dindex SDbw
## Number_clusters 2.0000 6.0000 2.0000 2.0000 0 4e+00 0 10.0000
## Value_Index 0.6477 3.2117 0.0563 0.0119 0 1e-04 0 0.2056
##
## $Best.partition
## [1] 1 1 1 1 3 3 1 1 3 1 1 1 1 3 3 1 3 1 1 1 1 1 1 1 1 1 1 1 1 1 1 3 3 3 3 3 3
## [38] 1 1 3 1 1 1 1 1 1 1 2 3 3 3 2 1 1 1 3 1 3 3 3 1 3 2 1 1 1 3 3 1 3 2 2 3 3
## [75] 3 1 1 3 1 1 1 1 3 3 3 3 1 3 3 1 1 1 3 1 1 1 1 1 1 1 1 1 1 1 1 3 1 1 3 1 1
## [112] 1 2 3 1 1 2 3 1 2 1 1 1 1 1 1 3 1 1 1 1 4 3 1 1 1 3 3 1 3 1 3 3 3 3 1 1 3
## [149] 3 1 2 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 3 2 3 1 1 2 1 1 3 3 1 1 3
## [186] 1 1 1 2 1 2 3 1 1 2 1 1 1 3 1 1 1 1 1 1 1 1 1 1 3 1 3 2 2 1 3 1 1 1 1 1 3
## [223] 1 3 1 3 1 3 1 1 1 1 1 1 1 1 1 1 1 1 1 3 1 3 1 1 3 1 1 1 1 1 1 1 1 1 1 1 1
## [260] 1 1 1 1 1 1 2 3 1 3 3 1 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1 1 1 3 1 1 1 1 1 1 1
## [297] 1 1 1 1 1 1 1 1 1 1
As we can observe following the majority rule (that is to say, taking into account all the results of the different algorithms performed which is the most “common”) in this case we should perform 4 clusters.
As we have explained before, due to having “outliers” (which are not as they are actually regions that exists and as we are studying regions capital cities/regions will be richer than the rest we cannot “delete” them) for that matter we will scale the data previous to analyzing the KMeans algorithm.
kmeans <- kmeans(scale(X),4) #kmeans(datos,nºclusters)
print(kmeans)
## K-means clustering with 4 clusters of sizes 70, 1, 19, 216
##
## Cluster means:
## 2011 2012 2013 2014 2015 2016 2017
## 1 0.5531622 0.5551541 0.5518595 0.5537592 0.5602461 0.5552855 0.5538666
## 2 10.1909225 10.2987871 10.3851878 10.3383390 10.2202363 10.1894438 10.1618972
## 3 2.3333886 2.3197810 2.3158040 2.3247456 2.3374197 2.3516821 2.3645584
## 4 -0.4316975 -0.4316454 -0.4306279 -0.4318132 -0.4344835 -0.4339879 -0.4345332
## 2018 2019 2020 2021
## 1 0.5527127 0.5517257 0.5742257 0.5774651
## 2 10.2336836 10.3156197 10.0460605 9.9932850
## 3 2.3562211 2.3418801 2.3513070 2.3614136
## 4 -0.4337582 -0.4325562 -0.4394291 -0.4411236
##
## Clustering vector:
## [1] 4 4 4 4 1 1 4 4 1 4 4 4 4 1 1 4 1 4 4 4 4 4 4 4 4 4 4 4 4 4 4 1 1 1 1 1 1
## [38] 4 4 4 4 4 4 4 4 4 4 3 1 1 1 3 4 4 4 1 4 1 1 1 4 1 3 4 4 4 1 1 4 1 3 3 1 1
## [75] 1 4 4 1 4 4 4 4 1 1 1 1 4 1 1 4 4 4 1 4 4 4 4 4 4 4 4 4 4 4 4 1 4 4 1 4 4
## [112] 4 3 1 4 4 3 1 4 3 4 4 4 4 4 4 1 4 4 4 4 2 1 4 4 4 1 1 4 1 4 1 1 1 1 4 4 1
## [149] 1 4 3 3 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 1 3 1 4 4 3 4 4 1 1 4 4 1
## [186] 4 4 4 3 4 3 1 4 4 3 4 4 4 1 4 4 4 4 4 4 4 4 4 4 1 4 1 3 3 4 1 4 4 4 4 4 1
## [223] 4 1 4 1 4 1 4 4 4 4 4 4 4 4 4 4 4 4 4 1 4 1 4 4 1 4 4 4 4 4 4 4 4 4 4 4 4
## [260] 4 4 4 4 4 4 3 1 4 1 1 4 4 4 4 4 4 4 4 4 4 3 4 4 4 4 4 4 4 1 4 4 4 4 4 4 4
## [297] 4 4 4 4 4 4 4 4 4 4
##
## Within cluster sum of squares by cluster:
## [1] 102.8933 0.0000 151.1976 118.9206
## (between_SS / total_SS = 88.9 %)
##
## Available components:
##
## [1] "cluster" "centers" "totss" "withinss" "tot.withinss"
## [6] "betweenss" "size" "iter" "ifault"
It is important when analyzing the results (the average values of each cluster individual) that if we scale we lose interpretation in terms of total value, although we can observe and compare between clusters for example we observe that cluster 2 correspond to clusters really “rich” in comparison to the others, or cluster 4 corresponding to cluster with a low value across the years.
Another thing we can do is adding the kmeans cluster etiquette to the original data and then performing the average of each cluster in the following way:
aggregate(X, by=list(cluster=kmeans$cluster), mean)
## cluster 2011 2012 2013 2014 2015 2016 2017
## 1 1 73158.13 74392.88 75167.79 76701.54 80088.07 81722.94 84400.74
## 2 2 615497.37 630604.71 644681.14 653969.01 670619.24 685852.59 705342.60
## 3 3 173335.62 175125.98 177329.53 181185.62 188729.63 194369.67 201421.01
## 4 4 17737.78 18061.78 18265.42 18555.05 19278.60 19688.52 20523.05
## 2018 2019 2020 2021
## 1 86549.79 89449.81 86738.93 94672.93
## 2 729515.61 759581.81 701018.05 757630.11
## 3 206330.57 212314.70 201988.40 220278.72
## 4 21032.90 21894.92 21000.13 22955.26
tgs0003cluster <- cbind(tgs00003_wide, cluster = kmeans$cluster)
head(tgs0003cluster)
## freq unit geo 2011 2012 2013 2014 2015 2016
## 1 A MIO_EUR AL01 2214.58 2297.46 2276.67 2290.45 2418.75 2549.18
## 2 A MIO_EUR AL02 4195.20 4294.60 4319.08 4535.08 4837.09 5050.74
## 3 A MIO_EUR AL03 2858.55 2993.75 3029.61 3143.06 3008.27 3119.93
## 4 A MIO_EUR AT11 7012.58 7365.43 7539.32 7737.13 8040.55 8351.77
## 5 A MIO_EUR AT12 48511.32 49802.60 50470.13 52049.38 53884.36 55570.82
## 6 A MIO_EUR AT13 80747.85 81981.67 83117.69 84749.77 87284.43 91942.33
## 2017 2018 2019 2020 2021 cluster
## 1 2707.01 2954.85 3165.02 2996.28 3390.40 4
## 2 5640.06 6327.54 6789.07 6628.26 7646.74 4
## 3 3211.98 3545.66 3800.10 3685.89 4120.24 4
## 4 8717.79 8960.75 9222.79 8948.33 9486.83 4
## 5 58207.93 60471.34 63009.70 59608.74 63973.78 1
## 6 92880.63 97152.54 99781.12 96063.91 102720.52 1
fviz_cluster(kmeans, data = X,
palette = c("YlGnBu"),
geom = "point",
ellipse.type = "convex",
ggtheme = theme_bw()
)
As we can observe the “4th cluster” corresponds to Paris (FR10) as we can observe in the following graph too with the etiquette of each region.
Both graphs serve different purposes as they are visualized in a different way as the second one is taking into account ellipses (which for the 4th cluster there is no ellipse as it just consists of one data point). It would be interesting to see the same plot if we decide to go for 3 or 2 clusters to see what happens with FR10 (if it continues being a one single individual cluster)
fviz_cluster(kmeans, data = X, geom = c("point"),ellipse.type = 'norm', pointsize=1)+
theme_minimal()+geom_text(label=regions,hjust=0, vjust=0,size=2,check_overlap = F)+scale_fill_brewer(palette="Paired")
## Too few points to calculate an ellipse
If we plot the clusters in the Map
tgs00003_shp123 <- tgs0003cluster |>
select(geo, cluster) |>
right_join(SHP_2_3035, by = "geo") |>
st_as_sf()
tgs00003_shp123 |>
ggplot(aes(fill = factor(cluster))) + # Convert 'cluster' to a factor
geom_sf(size = 0.1, color = "#333333") +
scale_fill_manual( # Use scale_fill_manual for factor variables
values = brewer_pal(palette = "Paired")(length(unique(tgs00003_shp123$cluster))),
name = "Kmeans Cluster",
na.value = "gray80",
guide = guide_legend(title.position = "top",
label.position = "right")
) +
scale_x_continuous(limits = c(2500000, 7000000)) +
scale_y_continuous(limits = c(1600000, 5200000)) +
labs(
title = "Regional gross domestic product",
subtitle = "by NUTS 2 regions - million EUR",
caption = "Data: Eurostat tgs00003"
) +
theme_void() +
theme(legend.position = c(0.94, 0.70))
This map allow us to understand how do they locate each cluster across the map, it would be interesting to locate the capital cities in the map to see the influence they have on nearby regions. Another thing to consider is that the “legend” of the cluster do not give any extra information, that is to say if a cluster is 1 it does not mean that they have a higher value in terms of regional gross domestic product, as we have observed the cluster 4 is the one with the highest value (also, if we ran again the algorithm the number it may change for that reason it may not be in the final result that the cluster 4 is the highest one), for that matter as KMeans has some “random” component or attribute in its algorithm as it depends on where the first initial clusters are placed, it would be interesting to recode the tags, or give them an actual name (mainly in the business sense if we were working with clients such as top clients, one hit wonder clients among other ideas).
Important to decide distance between observations and linkage to join groups
We need to decide first the distance and linkage
hc$labels <- regions
fviz_dend(x = hc,
k=4,
palette = "Set2",
rect = TRUE, rect_fill = TRUE, cex=0.5,
rect_border = "Set2"
)
## Warning: The `<scale>` argument of `guides()` cannot be `FALSE`. Use "none" instead as
## of ggplot2 3.3.4.
## ℹ The deprecated feature was likely used in the factoextra package.
## Please report the issue at <https://github.com/kassambara/factoextra/issues>.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
As we can observe in this case it follows the same pattern as KMeans algorithm as we can observe, there a one individual cluster which I assume is FR10 again, although due to how the plot works due to the amount of “regions” we could visualize it in another way.
fviz_dend(x = hc,
k = 4,
color_labels_by_k = TRUE,
cex = 0.8,
type = "phylogenic",
repel = TRUE)+ labs(title="Regional Gross GDP tree clustering of the world") + theme(axis.text.x=element_blank(),axis.text.y=element_blank())
This type of plot allows us to compare the different regions of the clusters following the tree map, as we can observe our hypothesis that the single individual cluster was still FR10, in this plot we can observe it with no problem, one issue with this plot again is that we are analyzing the codes of the regions and we do not know exactly which regions they are unless they are plotted in a map.
tgs0003clusterhierarc <- cbind(tgs00003_wide, cluster = cutree(hc, k = 4))
tgs00003_shp123hier <- tgs0003clusterhierarc |>
select(geo, cluster) |>
right_join(SHP_2_3035, by = "geo") |>
st_as_sf()
tgs00003_shp123hier |>
ggplot(aes(fill = factor(cluster))) + # Convert 'cluster' to a factor
geom_sf(size = 0.1, color = "#333333") +
scale_fill_manual( # Use scale_fill_manual for factor variables
values = brewer_pal(palette = "Paired")(length(unique(tgs00003_shp123$cluster))),
name = " Hierarchical Cluster",
na.value = "gray80",
guide = guide_legend(title.position = "top",
label.position = "right")
) +
scale_x_continuous(limits = c(2500000, 7000000)) +
scale_y_continuous(limits = c(1600000, 5200000)) +
labs(
title = "Regional gross domestic product",
subtitle = "by NUTS 2 regions - million EUR",
caption = "Data: Eurostat tgs00003"
) +
theme_void() +
theme(legend.position = c(0.94, 0.70))
Once again, it would be interesting to recode the variables so they match as much as possible to spot any differences between the type of clusters/regions in both maps, and even giving an etiquette depending on the mean values of the clusters to symbolize the differences in value between clusters.
In this case for this specific type of data there is a methodology which seems more suitable than KMeans or Hierarchical clustering due to not taking into account that the data is longitudinal and taking each variable as a different instance rather than all being a time-series.
This methodology is known as Club Convergence which was developed by Phillips and Sul (2007) and thhe main objective is to identify the various clubs that might be in a sample following the idea of convergence and transitional paths commonly studied in “economics”.
The following process represents how this algorithm works:
Cross-section classification: The different countries are ordered in decreasing order, i.e., from highest to lowest, taking into account the values of the last period.
Club formation: We start by forming groups from the country with the highest value in the last period. Then we look for the first k such that when we do the log t regression test statistic, we are left with t_k being greater than-1.65. This is done for the first two countries, and in case it is not satisfied, it is performed for the second and third countries, and so on until a pair of countries is found that does satisfy the test. In case there is no pair of countries, i.e., there is no k that meets this requirement, there would be no convergence subgroups in our data sample.
Screening of individuals to create convergence clubs: In the event that in the club formation, we have encountered a pair, we proceed to perform the same test by adding countries in the order we previously classified. When the criterion is no longer met, we would have our first club.
Recursion and stopping rule: A subgroup is made with the individuals that have not been screened in the previous step. The log t regression test is performed and if it is greater than -1.65, another group is formed. Otherwise, the three previous steps would be performed with this subgroup.
Schnurbus et al., (2017) would pose a fifth step, which is to merge clubs. The way it would be done would be to do the log t test regression for clubs 1 and 2, and in case it is met, we would then merge them. The same would then be done for the new club 1 and the next club, and so on until there are no more club mergers, so we would be left with the minimum number of clubs possible.
However, for details, you can follow the work of Phillips and Sul (2007, 2009) and Du (2017) which are easily accessible, the overall information can be seen on the work made by Du that contains everything to know about the PS methodology the merge rule, and how to replicate the analysis in Stata. The link to the academic publication is the following one: https://journals.sagepub.com/doi/10.1177/1536867X1801700407.
On the other hand, as Stata is not open source it has been developed in R in the following library called ClubConvergence
The main objective is to apply this algorithm and analyze the differences in comparison to the other cluster methodologies.
df_long <- tgs00003 |>
select(geo, TIME_PERIOD, values)
df_wide <- df_long |>
pivot_wider(names_from = TIME_PERIOD, values_from = values) |>
as.data.frame()
logvalues <- log(df_wide[,-1]) |> select(-"2022")
filteredvalues <- apply(logvalues, 1,
function(x){mFilter::hpfilter(x, freq=400, type="lambda")$trend} )
filteredvalues <- data.frame(Geo = df_wide[,1], t(filteredvalues), stringsAsFactors=FALSE )
df_wide <- df_wide |>
select(-"2022") |>
filter( geo != "HUZZ")
colnames(filteredvalues) <- colnames(df_wide)
filteredvalues <- filteredvalues |>
filter(geo != "HUZZ")
h <- computeH(filteredvalues[,-1], quantity = "h")
df_h <- data.frame(h)
geo_names <- df_wide |>
select(geo)
h_wide <- cbind(geo_names, df_h)
head(h_wide)
## geo X2011 X2012 X2013 X2014 X2015 X2016 X2017
## 1 AL01 0.7731388 0.7756596 0.7781774 0.7807090 0.7832689 0.7858620 0.7884841
## 2 AL02 0.8353647 0.8396425 0.8439148 0.8482014 0.8525162 0.8568663 0.8612499
## 3 AL03 0.8012472 0.8026406 0.8040302 0.8054259 0.8068429 0.8083041 0.8098187
## 4 AT11 0.8953381 0.8963822 0.8974149 0.8984297 0.8994205 0.9003837 0.9013159
## 5 AT12 1.0885294 1.0889850 1.0894364 1.0898807 1.0903137 1.0907333 1.0911365
## 6 AT13 1.1394287 1.1393677 1.1393058 1.1392409 1.1391702 1.1390918 1.1390017
## X2018 X2019 X2020 X2021
## 1 0.7911247 0.7937684 0.7964051 0.7990333
## 2 0.8656523 0.8700545 0.8744455 0.8788209
## 3 0.8113841 0.8129819 0.8145935 0.8162062
## 4 0.9022181 0.9030946 0.9039526 0.9047988
## 5 1.0915217 1.0918888 1.0922408 1.0925836
## 6 1.1389019 1.1387924 1.1386757 1.1385535
h_long <- pivot_longer(h_wide,
starts_with("X"),
names_to = "year",
names_prefix = "X",
values_to = "h")
clubs <- findClubs(df_wide,
dataCols=2:ncol(filteredvalues),
unit_names = 1,
refCol=ncol(filteredvalues),
time_trim =0.3,
cstar=0,
HACmethod = 'FQSB')
mclubs <- mergeClubs(clubs, mergeMethod='vLT', mergeDivergent=FALSE)
plot(mclubs, clubs=NULL, avgTP = TRUE, legend=TRUE, plot_args=list(type='o', xmarks=seq(1,11,1), xlab= "", xlabs=seq(2011,2021,1),xlabs_dir=1), legend_args=list(max_length_labels=10, y.intersp=1, cex= 0.5))
plot(mclubs, clubs=1, avgTP=FALSE, legend=TRUE, plot_args=list(type='o', xmarks=seq(1,11,1), xlabs=seq(2011,2021,1), xlab= "", ylab= "h",
legend_args=list(y.intersp=0.01, cex= 0.1)))
plot(mclubs, clubs=7, avgTP=FALSE, legend=TRUE, plot_args=list(type='o', xmarks=seq(1,11,1), xlabs=seq(2011,2021,1), xlab= "", ylab= "h",
legend_args=list(y.intersp=0.01, cex= 0.1)))
summary(mclubs)
## Number of convergence clubs: 15
## Number of divergent units: 4
##
## | merged clubs | # of units | beta | std.err | tvalue
## --------- --------------- ------------- ---------- ---------- ----------
## club1 | clubs: 1 | 19 | -0.218 | 0.095 | -2.309
## club2 | clubs: 2 | 13 | -0.027 | 0.209 | -0.13
## club3 | clubs: 3 | 10 | 0.13 | 0.314 | 0.415
## club4 | clubs: 4 | 18 | -0.894 | 0.559 | -1.599
## club5 | clubs: 5, 6 | 21 | 0.063 | 0.186 | 0.341
## club6 | clubs: 7 | 29 | 0.207 | 0.249 | 0.833
## club7 | clubs: 8 | 14 | 0.289 | 0.219 | 1.318
## club8 | clubs: 9 | 55 | -0.241 | 0.168 | -1.437
## club9 | clubs: 10 | 52 | -0.412 | 0.117 | -3.51
## club10 | clubs: 11 | 36 | 0.033 | 0.24 | 0.137
## club11 | clubs: 12 | 17 | 0.029 | 0.221 | 0.132
## club12 | clubs: 13 | 9 | 0.376 | 0.219 | 1.716
## club13 | clubs: 14 | 5 | -1.136 | 0.81 | -1.403
## club14 | clubs: 15 | 2 | 3.335 | 1.493 | 2.234
## club15 | clubs: 16 | 2 | 2.61 | 1.565 | 1.667
print(mclubs)
## ========================================================================
## club 1
## ------------------------------------------------------------------------
## FRK2, IE06, ES30, ES51, DEA1, DE11, DE71, TR10, DEA2, ITI4, FRL0, NL32,
## NL33, SE11, ITH5, ITH3, DE30, ES61, IE05
##
## beta: -0.2184
## std.err: 0.0946
## tvalue: -2.3091
## pvalue: 0.0105
##
## ========================================================================
## club 2
## ------------------------------------------------------------------------
## DK01, CH04, ITC1, CH02, DEA5, DE12, NL41, NO08, DE60, CH01, FRG0, FRE1,
## PL91
##
## beta: -0.0271
## std.err: 0.2089
## tvalue: -0.1296
## pvalue: 0.4484
##
## ========================================================================
## club 3
## ------------------------------------------------------------------------
## ITI1, ES52, FRI1, ITF3, FRH0, DEF0, AT13, BE21, FI1B, DE94
##
## beta: 0.1301
## std.err: 0.3139
## tvalue: 0.4146
## pvalue: 0.6608
##
## ========================================================================
## club 4
## ------------------------------------------------------------------------
## NOZZ, SE23, FRJ2, CH03, DEA3, DE13, ITG1, DEB3, BE10, NL22, DE14, DE92,
## EL30, CH05, DE25, DEA4, NL31, NO0A
##
## beta: -0.8939
## std.err: 0.5589
## tvalue: -1.5994
## pvalue: 0.0549
##
## ========================================================================
## club 5
## ------------------------------------------------------------------------
## FRJ1, DE40, DE27, ITF4, DE91, FRB0, RO32, CZ01, SE12, PT17, LU00, ES21,
## DK04, AT31, SE22, PL22, DEE0, DEG0, CH06, PT11, FRF1
##
## beta: 0.0634
## std.err: 0.1859
## tvalue: 0.3408
## pvalue: 0.6334
##
## ========================================================================
## club 6
## ------------------------------------------------------------------------
## AT12, ES11, TR51, BE23, DK03, FRF3, ES41, FRD2, PL41, DE26, HU11, FI19,
## BE24, FRI3, DEB1, DED2, DE22, DE93, BE25, FRE2, AT22, DE23, NL42, DE80,
## NL21, PL51, FI1D, PL21, DE73
##
## beta: 0.2073
## std.err: 0.249
## tvalue: 0.8327
## pvalue: 0.7975
##
## ========================================================================
## club 7
## ------------------------------------------------------------------------
## ITC3, FRC1, FI1C, DE24, DED4, ES42, PT16, DED5, BG41, CZ06, PL71, PL63,
## IE04, RO11
##
## beta: 0.2887
## std.err: 0.219
## tvalue: 1.3183
## pvalue: 0.9063
##
## ========================================================================
## club 8
## ------------------------------------------------------------------------
## TR42, TR31, ITI3, ES70, FRD1, FRK1, ITH4, FRF2, SE21, DE72, TR41, ES24,
## BE32, DEC0, NO09, SE31, ITG2, DE50, BE33, AT33, ITF6, ITF1, LV00, DK02,
## ES62, LT02, FRC2, BE22, EE00, SK02, CH07, PL92, ES53, SI04, AT32, SE33,
## SK01, DK05, RO31, CZ05, RO12, TR62, CZ02, NL11, NO06, NO07, ITH1, RO21,
## PL61, CY00, EL52, LT01, TR61, RO22, ES12
##
## beta: -0.2412
## std.err: 0.1679
## tvalue: -1.4367
## pvalue: 0.0754
##
## ========================================================================
## club 9
## ------------------------------------------------------------------------
## ITI2, NL12, CZ03, SI03, PL82, AT21, TR32, CZ07, RO42, TR33, ITH2, BE31,
## RS11, PL81, SK04, PL42, CZ08, ES43, AT34, ES22, FRY4, FRI2, SK03, TR21,
## HR05, TR63, HR03, RO41, NO02, DEB2, TRC1, HU12, NL13, SE32, CZ04, TR52,
## NL23, HU21, MT00, TR72, NL34, PL62, HU32, BE35, HU22, HU33, ES13, RS12,
## PT18, PL72, PL84, HU31
##
## beta: -0.4123
## std.err: 0.1175
## tvalue: -3.5102
## pvalue: 2e-04
##
## ========================================================================
## club 10
## ------------------------------------------------------------------------
## TR83, ITF5, TR90, TR22, PL43, MK00, PL52, TRC2, HR02, FRM0, RS21, TRC3,
## FRY1, PT15, BG42, HR06, AT11, EL61, HU23, EL64, TR71, FRY2, EL43, ES23,
## TRB1, BG34, EL65, RS22, EL63, BE34, AL02, BG33, EL51, ITF2, TRB2, BG32
##
## beta: 0.0328
## std.err: 0.2396
## tvalue: 0.137
## pvalue: 0.5545
##
## ========================================================================
## club 11
## ------------------------------------------------------------------------
## TR81, EL42, TRA1, PT30, BG31, ME00, ITC2, TR82, PT20, FRY3, AL03, EL54,
## TRA2, EL53, AL01, FRY5, EL62
##
## beta: 0.0292
## std.err: 0.2212
## tvalue: 0.1319
## pvalue: 0.5525
##
## ========================================================================
## club 12
## ------------------------------------------------------------------------
## NLZZ, EL41, DKZZ, ITZZ, ES63, ES64, FI20, ESZZ, FRZZ
##
## beta: 0.3763
## std.err: 0.2193
## tvalue: 1.716
## pvalue: 0.9569
##
## ========================================================================
## club 13
## ------------------------------------------------------------------------
## BEZZ, NO0B, PTZZ, ROZZ, ATZZ
##
## beta: -1.1358
## std.err: 0.8096
## tvalue: -1.4029
## pvalue: 0.0803
##
## ========================================================================
## club 14
## ------------------------------------------------------------------------
## SEZZ, FIZZ
##
## beta: 3.3349
## std.err: 1.493
## tvalue: 2.2336
## pvalue: 0.9872
##
## ========================================================================
## club 15
## ------------------------------------------------------------------------
## RSZZ, MTZZ
##
## beta: 2.6097
## std.err: 1.5651
## tvalue: 1.6675
## pvalue: 0.9523
##
## ========================================================================
## divergent
## ------------------------------------------------------------------------
## FR10, ITC4, DE21, LVZZ
tp <- transition_paths(clubs, output_type = 'data.frame')
In case we want to plot another clubs we can use:
plot(mclubs, clubs=1, avgTP=FALSE, legend=TRUE, plot_args=list(type=‘o’, xmarks=seq(1,11,1), xlabs=seq(2011,2021,1), xlab= ““, ylab=”h”, legend_args=list(y.intersp=0.01, cex= 0.1)))
And change the parameter “clubs=”.
merged_df <- merge(tgs00003_wide, tp, by.x = "geo", by.y = "unit_name", all.x = TRUE)
tgs00003_shpmerged_df <- merged_df |>
select(geo, club) |>
right_join(SHP_2_3035, by = "geo") |>
st_as_sf()
tgs00003_shpmerged_df$club <- factor(tgs00003_shpmerged_df$club, levels = c("club1", "club2", "club3", "club4", "club5", "club6", "club7", "club8", "club9", "club10", "club11", "club12", "club13", "club14", "club15", "divergent"))
tgs00003_shpmerged_df |>
ggplot(aes(fill = factor(club))) + # Convert 'cluster' to a factor
geom_sf(size = 0.1, color = "#333333") +
scale_fill_manual( # Use scale_fill_manual for factor variables
values = c("dodgerblue2", "#E31A1C", "green4","#6A3D9A", "#FF7F00", "skyblue2", "#FB9A99", "palegreen2", "#CAB2D6", "#FDBF6F", "khaki2", "maroon", "orchid1","steelblue4", "black", "yellow4")
,
name = " Club Convergence Cluster",
na.value = "gray80",
guide = guide_legend(title.position = "top",
label.position = "right")
) +
scale_x_continuous(limits = c(2500000, 7000000)) +
scale_y_continuous(limits = c(1600000, 5200000)) +
labs(
title = "Regional gross domestic product",
subtitle = "by NUTS 2 regions - million EUR",
caption = "Data: Eurostat tgs00003"
) +
theme_void() +
theme(legend.position = c(0.94, 0.70))
To conclude the analysis, we have seen that the way we construct our database may vary the data, and that KMeans, and common clustering algorithms do not seem suitable with this type of data in comparison to the ClubConvergence algorithm which is more suitable in most cases, it is important to notice that we could possibly combine both methodologies, that is to say take into account a first cluster algorithm taking into account several time-series data on regions to perform PCA on each individually, combine the first components of each, then cluster the data and run a simple classification algorithm to obtain the feature importance to build an “index” to later on perform the ClubConvergence algorithm. These are all ideas we could do to perform a workflow to analyze not only one variable throughout a certain time period but rather account for several different variables in the same time period as a way to analyze clusters across the European Union countries taking into account GDP, Unemployment Rate among other ideas.
Analyzing clusters may help policy makers understand where to focus the financial aid or any other type aid, as well as ClubConvergence tags allowing to compare them easily, that is to say Club1 is better off than Club2 and so on, meaning that “lower value clubs” need more aid than higher ones in the classification opposed to KMeans algorithm that needs more detail on the tags of the KMeans as it depends on when you ran the algorithm due to the random component on the first place centroids are located as explain on the methodology.
It is interesting to compare them, and this project served as an initial point to do so, explaining the advantages and disadvantages each algorithm has and their different purpose and their respective use. We may also combine this methods to analyze more in detail different variables across time to cluster countries as explained above.
The data set analyzed may not be the perfect example to analyze convergence clubs as we may take into account the population of each region to account for differences in size, which can be done easily manipulating the original data set.
All the code is easily replicable as no local data set has been used, everything has been done through libraries available and the data set has been obtained through the data set Eurostat. All clusters serve their own purpose and has their own strengths so it is interesting to analyze as much of them as possible as depending on the context one is more suitable than the others.